Introduction to Open Data Science - Course Project

Chapter 1 - Warmup

Feelings and expectations

I’m quite excited to participate in this course. I’ve already used R and R Markdown before but I know my way of coding is not efficient or clean. This is why I expect to learn better methods to use R, R Markdown and other tools introduced during the next month or so.

My PhD supervisor told me about this course because she thought I could greatly benefit from this.

Trying this for fun

# I want to print out today's date

date()
## [1] "Tue Apr 13 14:24:36 2021"

I did it. Hooray!

Creating a list

Let’s try this out.

  • First point
  • Second point
  • Third point

Chapter 2 - Data wrangling and analysis

Introducing the data

First we read the students2014 data, which is provided in a website. After that, we look at what the data consist of.

# Download the data

students2014 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt ", sep=",", header=TRUE)

# First look at the data

head(students2014)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21
# Look at the dimensions and structures of the data

dim(students2014)
## [1] 166   7
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

So the data include:

  • demographic information about students (age, gender)
  • the student’s attitude score towards statistics based on a questionnare (attitude)
  • student exam points (points)
  • student scores on deep learning based on a questionnare (deep)
  • student scores on strategic learning based on a questionnare (stra)
  • student scores on surface-level learning based on a questionnare (surf)

The data frame has 166 observations (rows) and 7 variables (columns). When we take a look at the structure, most individual observations are numerical. Two columns include integers (age, points) and one includes factors (gender).

Let’s look at a summary of the observations.

summary(students2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Most of the students are women and most students are in their 20s. The oldest student is 55 years old. Based on the summary, attitude, deep, strategic and surface scores have a scale from 1-5. The highest exam points has been 33 points, and most of the points were above 20.

Graphical overview of the data

Let’s plot the whole dataset to get a general idea, how the data behave in pairs.

plot(students2014[-1], col = students2014$gender)

This plot can be used to observe possible trends. We are interested in discovering which variable contributes to high or low exam points. For this reason, we’ll take a look at the last row or at the right most column, since exam points are paired to other variables on those sets. For example, on the second column’s (attitude) last row (points) we see some form of correlation. The data points scew from the lower left corner to the upper right corner in a similar fashion to a trend line, giving us a clue that the exam points seem to be high (or low) when the attitude scores are high (or low). A similar trend might be seen in the fifth column’s fourth row (stra).

We can use ggplot2 and GGally packages to generate a more advanced plot. Disregard the “upper” part inside the ggpairs command if you’re not using Linux. Otherwise FYI for Linux users: Ubuntu has some font issues with GGally and that part of the code is a workaround.

library(ggplot2)
library(GGally)

plot <- ggpairs(students2014, mapping = aes(col=gender, alpha = 0.3), upper = list(continuous = "cor_v1_5"), lower = list(combo = wrap("facethist", bins = 20)))

plot

We get some additional information out of this plot. It is important to look at the correlation values (Corr). The higher the absolute value, the greater the correlation between the variables. We observe that points and attitude have a positive correlation value of 0.437, which is the highest in this dataset. Other correlations worth noting are strategic and surface, which both have approximately a 0.14 value. Surface scores correlate negatively with points since the value is negative.

Regression model

Since attitude, surf and stra are most correlated with points, we’ll fit a linear model on these variables. Our variable of interest is points (y) and our explanatory variables (x1, x2, x3) are attitude, stra and surf.

model <- lm(points ~ attitude + surf + stra, data = students2014)

summary(model)
## 
## Call:
## lm(formula = points ~ attitude + surf + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## surf         -0.5861     0.8014  -0.731  0.46563    
## stra          0.8531     0.5416   1.575  0.11716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Considering explanatory variables, we see from the summary that only attitude has a significant correlation (p < 0.001) with points. Surf and stra don’t correlate significantly with points. We’ll run the model again with attitude only, since surf and stra were unnecessarily included.

model <- lm(points ~ attitude, data = students2014)

summary(model)
## 
## Call:
## lm(formula = points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The p value became slightly more significant compared to our previous model. Additionally, we observe that attitude has a positive correlation with points since the estimate value is greater than zero. The intercept refers to the point where the regression line intercepts with y axis, so when x (attitude) is zero, y (points) is 11.6372. The intercept value is statistically significant (p < 0.001).

Model diagnostics

To confirm our model works for our data, we’ll run some diagnostics. Let’s generate the plots first.

par(mfrow = c(2,2))
plot(model, which = c(1,2,5))

Linear model assumes the residuals are normally distributed with a constant variance and that the residuals don’t correlate. From the Residuals vs. Fitted plot we observe that the data points vary in a quite random fashion. For example, if the data points were more close to each other on the left side but scattered on the right side, we would have a problem.

Q-Q plot can be used to confirm the normality assumption: the better the points fall on the line, the better the model fit. Some of the points in the beginning and end of the line stray, but overall the points follow our dashes quite nicely.

Residuals vs. Leverage is used for the dependacy assumption. We can to identify outliers if necessary. In our plot, there are no identifiable outliers, since the data points are fairly close to each other, which the x axis values confirm. If the x axis values were greater and there was a single point far away from the other data points, we would have to reconsider our model.

Overall, our diagnostics show that our linear model can be used to fit the data.


Chapter 3 - Logistic regression

Let’s download the data we’re using for this exercise and explore the structure.

student_alc <- read.table("pormath.txt")

dim(student_alc)
## [1] 370  51
str(student_alc)
## 'data.frame':    370 obs. of  51 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
##  $ Medu      : int  1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : int  1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
##  $ traveltime: int  2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : int  4 2 1 3 3 3 3 2 4 4 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
##  $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
##  $ famrel    : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : int  1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : int  1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : int  1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : int  1 5 2 5 3 5 5 4 3 4 ...
##  $ n         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ id.p      : int  1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
##  $ id.m      : int  2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
##  $ failures  : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
##  $ absences  : int  3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : int  10 10 14 10 12 12 11 10 16 10 ...
##  $ G2        : int  12 8 13 10 12 12 6 10 16 10 ...
##  $ G3        : int  12 8 12 9 12 12 6 10 16 10 ...
##  $ failures.p: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ paid.p    : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
##  $ absences.p: int  4 2 8 2 2 2 0 0 6 10 ...
##  $ G1.p      : int  13 13 14 10 13 11 10 11 15 10 ...
##  $ G2.p      : int  13 11 13 11 13 12 11 10 15 10 ...
##  $ G3.p      : int  13 11 12 10 13 12 12 11 15 10 ...
##  $ failures.m: int  1 2 0 0 2 0 2 0 0 0 ...
##  $ paid.m    : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 2 1 1 ...
##  $ absences.m: int  2 2 8 2 8 2 0 2 12 10 ...
##  $ G1.m      : int  7 8 14 10 10 12 12 8 16 10 ...
##  $ G2.m      : int  10 6 13 9 10 12 0 9 16 11 ...
##  $ G3.m      : int  10 5 13 8 10 11 0 8 16 11 ...
##  $ alc_use   : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ high_use  : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
##  $ cid       : int  3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...
head(student_alc)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
## 1     GP   F  15       R     GT3       T    1    1  at_home    other       home
## 2     GP   F  15       R     GT3       T    1    1    other    other reputation
## 3     GP   F  15       R     GT3       T    2    2  at_home    other reputation
## 4     GP   F  15       R     GT3       T    2    4 services   health     course
## 5     GP   F  15       R     GT3       T    3    3 services services reputation
## 6     GP   F  15       R     GT3       T    3    4 services   health     course
##   guardian traveltime studytime schoolsup famsup activities nursery higher
## 1   mother          2         4       yes    yes        yes     yes    yes
## 2   mother          1         2       yes    yes         no      no    yes
## 3   mother          1         1       yes    yes        yes     yes    yes
## 4   mother          1         3       yes    yes        yes     yes    yes
## 5    other          2         3        no    yes        yes     yes    yes
## 6   mother          1         3       yes    yes        yes     yes    yes
##   internet romantic famrel freetime goout Dalc Walc health n id.p id.m failures
## 1      yes       no      3        1     2    1    1      1 2 1096 2096        0
## 2      yes      yes      3        3     4    2    4      5 2 1073 2073        1
## 3       no       no      4        3     1    1    1      2 2 1040 2040        0
## 4      yes       no      4        3     2    1    1      5 2 1025 2025        0
## 5      yes      yes      4        2     1    2    3      3 2 1166 2153        1
## 6      yes       no      4        3     2    1    1      5 2 1039 2039        0
##   paid absences G1 G2 G3 failures.p paid.p absences.p G1.p G2.p G3.p failures.m
## 1  yes        3 10 12 12          0    yes          4   13   13   13          1
## 2   no        2 10  8  8          0     no          2   13   11   11          2
## 3   no        8 14 13 12          0     no          8   14   13   12          0
## 4   no        2 10 10  9          0     no          2   10   11   10          0
## 5  yes        5 12 12 12          0    yes          2   13   13   13          2
## 6   no        2 12 12 12          0     no          2   11   12   12          0
##   paid.m absences.m G1.m G2.m G3.m alc_use high_use  cid
## 1    yes          2    7   10   10     1.0    FALSE 3001
## 2     no          2    8    6    5     3.0     TRUE 3002
## 3    yes          8   14   13   13     1.0    FALSE 3003
## 4    yes          2   10    9    8     1.0    FALSE 3004
## 5    yes          8   10   10   10     2.5     TRUE 3005
## 6    yes          2   12   12   11     1.0    FALSE 3006

The dataset includes student performance information in secondary education of two Portuguese schools. We have 370 students (rows or observations) and 51 attributes (columns or variables). These 51 variables include sex, grades (G1, G2, G3), age, school, and information about the student’s social and family factors. Two distinct school subjects were of interest: math and Portuguese language.

More details about the dataset can be found here: https://archive.ics.uci.edu/ml/datasets/Student+Performance

Overview of the data

In this exercise we want to explore the relationship between alcohol consumption and academic performance. The dataset was modified to inlcude alc_use and high_use, the first including average alcohol consumption during the week. The latter was constructed so that high alcohol use was set to be over 2. We are only including G3 grades, since this column includes final grades. Note that the data were wrangled with Reijo Sund’s code, so the dataset differs from the DataCamp set.

Let’s look at high_use, absences, study time, failures and if there are differences between men and women. I hypothesise high consumption results in lower grades, more absences, less study time and more failures. I also hypothesise there are differences between the sexes.

First we’ll cross-tabulate. I excluded absences from cross-tabulation since the results were hard to interpret.

library(tidyr); library(dplyr); library(ggplot2)
student_alc %>% group_by(failures, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 8 x 4
## # Groups:   failures [4]
##   failures high_use count mean_grade
##      <int> <lgl>    <int>      <dbl>
## 1        0 FALSE      238      12.2 
## 2        0 TRUE        87      11.5 
## 3        1 FALSE       12       8.08
## 4        1 TRUE        12       9.25
## 5        2 FALSE        8       7   
## 6        2 TRUE         9       8.33
## 7        3 FALSE        1      10   
## 8        3 TRUE         3       7.33
student_alc %>% group_by(studytime, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 8 x 4
## # Groups:   studytime [4]
##   studytime high_use count mean_grade
##       <int> <lgl>    <int>      <dbl>
## 1         1 FALSE       56       11  
## 2         1 TRUE        42       10.4
## 3         2 FALSE      128       11.8
## 4         2 TRUE        57       10.8
## 5         3 FALSE       52       12.4
## 6         3 TRUE         8       13  
## 7         4 FALSE       23       12.3
## 8         4 TRUE         4       12.5
student_alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count mean_grade
##   <fct> <lgl>    <int>      <dbl>
## 1 F     FALSE      154       11.4
## 2 F     TRUE        41       11.8
## 3 M     FALSE      105       12.3
## 4 M     TRUE        70       10.3

We see that 0 failures are associated with lower alcohol consumption (high_use FALSE) and higher grades, while 0 failures and high alcohol consumption is associated with lower mean grade. The relationship is not as straightforward with more failures. Most of the student’s study time falls to category 2, where no alcohol is associated with better grades, but we also have a case of high alcohol consumption and higher study time (3-4) with slightly higher grades. Finally, men with high alcohol consumption seem to have lower grades.

Now for some plots, in which we separate by sex.

ggplot(student_alc, aes(x = high_use, y = G3, col = sex))+ geom_boxplot() + ggtitle("Student grades by alcohol consumption and sex")

ggplot(student_alc, aes(x = high_use, y = absences, col = sex))+ geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")

ggplot(student_alc, aes(x=studytime, fill = sex)) + geom_bar() + ggtitle("Study time by alcohol consumption and sex") + facet_wrap(~high_use)

ggplot(student_alc, aes(x=failures, fill = sex)) + geom_bar() + ggtitle("Failures by alcohol consumption and sex") + facet_wrap(~high_use)

From these plots we see that men’s grades are slightly lower when alcohol consumption is high. Absences seem to be slightly more common when alcohol consumption is high with both sexes. The relationship between consumption and study time is again not so straightforward. Failures on the other hand are more common in men when alcohol consumption is high. However, it’s good to bear in mind that the total number of high consumers is lower compared to low consumers of alcohol. My initial hypothesis was not too far off, but I’m surprised how different the results are between men and women.

Logistic regression

We’ll run a generalised linear model with binomial distribution. We’ll explore the relationship of high alcohol use to the previously discussed variables failures, absences, sex, study time and final year grades (G3).

# Generalised linear model
model <- glm(high_use ~ failures + absences + sex + studytime + G3, data = student_alc, family = "binomial")

# Summary of the model
summary(model)
## 
## Call:
## glm(formula = high_use ~ failures + absences + sex + studytime + 
##     G3, family = "binomial", data = student_alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0917  -0.8297  -0.5871   1.0191   2.1975  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.72528    0.61497  -1.179 0.238246    
## failures     0.43511    0.22355   1.946 0.051610 .  
## absences     0.08514    0.02320   3.670 0.000242 ***
## sexM         0.86638    0.25849   3.352 0.000803 ***
## studytime   -0.32357    0.16400  -1.973 0.048505 *  
## G3          -0.03905    0.04017  -0.972 0.330952    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 401.52  on 364  degrees of freedom
## AIC: 413.52
## 
## Number of Fisher Scoring iterations: 4

We see that high alcohol use is statistically significantly associated with absences and with men. We see men (“sexM”) in the coefficients because the model used women (“sexF”) as a baseline, so compared to the baseline, male students are significantly associated with high alcohol use. Study time has a weaker relationship with high alcohol use, failures even more so, but the relationship is there. However, grades have no statistical significance in the model, so we’ll rerun the model without grades.

model <- glm(high_use ~ failures + absences + sex + studytime, data = student_alc, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ failures + absences + sex + studytime, 
##     family = "binomial", data = student_alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0891  -0.8278  -0.5908   1.0193   2.1343  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.15122    0.43431  -2.651 0.008033 ** 
## failures     0.50928    0.21127   2.411 0.015928 *  
## absences     0.08650    0.02322   3.725 0.000195 ***
## sexM         0.84863    0.25759   3.295 0.000986 ***
## studytime   -0.33939    0.16338  -2.077 0.037771 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 402.47  on 365  degrees of freedom
## AIC: 412.47
## 
## Number of Fisher Scoring iterations: 4

We see that these four variables’ statistical significance increased. Also, the intercept became statistically significant, so it’s more relevant to compute predictions in future steps. Let’s calculate odds ratios and their confidence intervals.

# Odds ratios (OR)
OR <- coef(model) %>% exp

# Confidence intervals (CI)
CI <- confint(model) %>% exp

# Print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.3162522 0.1336630 0.7367549
## failures    1.6640983 1.1072255 2.5501879
## absences    1.0903516 1.0441945 1.1440122
## sexM        2.3364412 1.4162568 3.8963237
## studytime   0.7122080 0.5124273 0.9741384

Odds higher than 1 indicate that our explanatory variable has a positive association with our dependent variable (high_use). Therefore, men, absences and failures have a positive assocation but study time doesn’t. When we look at the confidence intervals, no interval contains the value 1, which indicates that their p value is lower than 0.05 and we can conclude the null hypothesis (variable has no relationship with high alcohol use) is false with 95 % certainty.

Based on these results, the hypothesis of high alcohol conspumtion resulting in lower grades, more absences, less study time and more failures is false when it comes to grades and study time. However, high consumption has a positive association with more absences and more failures. I also hypothesised there would be a difference between men and women, and this has proven to be correct. We’ll run the GLM model one more time to exclude study time, in order increase statiscial significance and predictive power of our model.

model <- glm(high_use ~ failures + absences + sex, data = student_alc, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ failures + absences + sex, family = "binomial", 
##     data = student_alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1550  -0.8430  -0.5889   1.0328   2.0374  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.94150    0.23129  -8.394  < 2e-16 ***
## failures     0.59759    0.20698   2.887  0.00389 ** 
## absences     0.09245    0.02323   3.979 6.91e-05 ***
## sexM         0.99731    0.24725   4.034 5.49e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 406.99  on 366  degrees of freedom
## AIC: 414.99
## 
## Number of Fisher Scoring iterations: 4
OR <- coef(model) %>% exp
CI <- confint(model) %>% exp
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.1434892 0.08940913 0.2219242
## failures    1.8177381 1.21997630 2.7642305
## absences    1.0968598 1.05041937 1.1508026
## sexM        2.7109881 1.67967829 4.4367282

We see that the statistical significance of failures increased and the odds ratios of failures and sexM increased.

Prediction and cross-validation

Let’s compute some predictions on our data.

# Predict the probability of high_use
probabilities <- predict(model, type = "response")

# Add the predicted probabilities to student_alc
student_alc$probability <- probabilities

# Use the probabilities to make a prediction of high_use
student_alc <- mutate(student_alc, prediction = probability > 0.5)

# Tabulate the target variable versus the predictions
table(high_use = student_alc$high_use, prediction = student_alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   252    7
##    TRUE     78   33
# Tabulate the target variable versus the predictions
table(high_use = student_alc$high_use, prediction = student_alc$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.68108108 0.01891892 0.70000000
##    TRUE  0.21081081 0.08918919 0.30000000
##    Sum   0.89189189 0.10810811 1.00000000

Our 2x2 table is a confusion matrix and it shows how well the predictions worked. The first table shows the actual numbers while the second shows prediction probabilites. In the first table, we see that when high_use is FALSE, the model predicted FALSE 252 times, which is correct, and TRUE 7 times, which is incorrect. However, according to the same table, the model was not good at predicting actual high_use, predicting actual TRUE 33 times and incorrect FALSE 78 times. The second table shows the same info with probabilites of the predictions and according to that table, actual FALSE predictions are correct most of the time (p = 0.68), while actual TRUE predictions didn’t perform too well (p = 0.089).

Let’s perform a 10-fold cross-validation.

# Define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# Compute the average number of wrong predictions in the (training) data
loss_func(student_alc$high_use, prob = student_alc$probability)
## [1] 0.2297297
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = student_alc, cost = loss_func, glmfit = model, K = 10)

# Average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2324324

The lower the average number of wrong predictions, the better. The training data had an average number of ~0.23 and our cross-validated data got an average number of 0.24. The cross-validated number which is slightly better than DataCamp’s 0.26 error. However, this can be due to different processing of the data, since the data wrangling process in DataCamp was not perfect and I used the corrected version by Reijo Sund.


Chapter 4 - Clustering and classification

Overview of the data

Let’s download the data for this exercise.

library(MASS)
data("Boston")

The Boston dataset is titled “Housing Values in Suburbs of Boston” so the dataset includes information on variables which influence these values.

library(dplyr)
glimpse(Boston)
## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…

The dataset has 14 columns and 506 observations. These columns include things like

  • Crime rate (crim) per capita
  • Proportion of residential land zoned (zn) for lots over 25,000 sq.ft. (whatever that converts to, Imperial units suck)
  • Nitrogen oxides concentration (nx), parts per million
  • Proportion of owner-occupied units built prior to 1940 (age)
  • Weighed means of distances (dis) to five Boston employment centres
  • Full-value propery-tax rate per $10,000 (tax)

And so on. Let’s look at some data summaries.

library(ggplot2);library(GGally)
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
ggpairs(Boston, upper = list(continuous = "cor_v1_5"))

There’s a lot going on in these summaries due to the amount of variables. Some of the variables tend to fall to an extreme in distribution. For example, crime rates are mostly on the low side. And there are many areas with a high proportion of black people. The variable for average number of rooms per dwelling (rm) is the only one that is normally distributed. We also have some binary variables, such as Charles River dummy variable (chas) and index of accessibility to radial highways (rad). Some linear relationships can be seen e.g. between nx and age, as well as zn and dis.

Scaling the data

We’ll scale the data by subtracting the column means from the corresponding columns and divide the difference with standard deviation. It’s simple to perform with the scale function.

Boston_scaled <- scale(Boston)

# New summaries of the data after converting to data frame
Boston_scaled <- as.data.frame(Boston_scaled)
summary(Boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
ggpairs(Boston_scaled, upper = list(continuous = "cor_v1_5"))

The distributions didn’t change but scaling changed the units. Next we’ll create a categorical variable of crime rate.

# Creating a quantile vector
quantiles <- quantile(Boston_scaled$crim)

# Creating the categorical variable
crime <- cut(Boston_scaled$crim, breaks = quantiles, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127

Looks good. We’ll add this new crime variable to the dataset and remove the old one.

# Remove crim from the dataset
Boston_scaled <- dplyr::select(Boston_scaled, -crim)

# Add crime
Boston_scaled <- data.frame(Boston_scaled, crime)

Next we’ll create a test and training set. We’ll choose 80 % of the data for the training set, so that the remaining 20 % fall to the test set.

# Randomly choose 80% of the scaled Boston rows
ind <- sample(nrow(Boston_scaled),  size = nrow(Boston_scaled) * 0.8)

# Create a training set
train <- Boston_scaled[ind,]

# Create a test set
test <- Boston_scaled[-ind,]

Linear discriminant analysis

We’ll fit a linear discriminant analysis on the training set. For the sake of simplicity, we’ll run with two dimensions.

lda_model <- lda(crime ~ ., data = train)
lda_model
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2574257 0.2475248 0.2301980 0.2648515 
## 
## Group means:
##                  zn      indus        chas        nox         rm        age
## low       1.0557183 -0.9684478 -0.08304540 -0.9195942  0.4425080 -0.9476028
## med_low  -0.1395066 -0.2931305  0.04263895 -0.5405360 -0.1238290 -0.3024954
## med_high -0.3655244  0.1734128  0.19334945  0.3830847  0.1909604  0.4040860
## high     -0.4872402  1.0170108 -0.01476176  1.0749031 -0.3743531  0.8079826
##                 dis        rad        tax     ptratio       black       lstat
## low       0.9399577 -0.6881287 -0.7214038 -0.47556488  0.37865374 -0.77565846
## med_low   0.2842639 -0.5615323 -0.4511004 -0.09262755  0.31137642 -0.13053567
## med_high -0.3597911 -0.3644158 -0.2885715 -0.31421797  0.05551374 -0.01672123
## high     -0.8576114  1.6392096  1.5148289  0.78203563 -0.82424628  0.80825471
##                 medv
## low       0.56235107
## med_low   0.01154643
## med_high  0.20547493
## high     -0.66082389
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.09848190  0.73643587 -0.80578885
## indus    0.03976686 -0.41778643  0.05123194
## chas    -0.07980895 -0.01857315  0.13726551
## nox      0.48075805 -0.55228589 -1.44297535
## rm      -0.11866804 -0.13368668 -0.22661954
## age      0.28666163 -0.36557583 -0.16859850
## dis      0.02444929 -0.26349594 -0.10688395
## rad      3.06738830  0.89081087 -0.19503976
## tax     -0.10713901  0.09315526  0.82611831
## ptratio  0.11439060  0.04937801 -0.22607945
## black   -0.13857858 -0.01349335  0.08509017
## lstat    0.26105874 -0.20735010  0.32649717
## medv     0.23804953 -0.31290314 -0.20776933
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9453 0.0411 0.0136
plot(lda_model, dimen = 2, col = as.numeric(train$crime), pch = as.numeric(train$crime))

We see that most of the high crime rate points separate from the rest. LD1 explains 94 % of the between-group variance.

We’ll run a prediction model with LDA. Before that, the crime classes will be removed from the test set.

# Save the correct classes from test data
correct_classes <- test$crime

# Remove the crime variable from test data
test <- dplyr::select(test, -crime)

# Predict with LDA
lda_predict <- predict(lda_model, newdata = test)
table(correct = correct_classes, predicted = lda_predict$class)
##           predicted
## correct    low med_low med_high high
##   low       10      10        3    0
##   med_low    8      15        3    0
##   med_high   0      15       18    0
##   high       0       0        0   20

The model is best at predicting high crime rate, medium high being the second best, medium low third, and low being the most difficult to predict.

K-means algorithm

Next we’ll calculate distances between the observations. The dataset needs to be reloaded, after which we’ll scale the data and then calculate the distances.

data("Boston")
Boston_rescaled <- scale(Boston)

# Euclidian distances
euc_Boston <- dist(Boston_rescaled)
#Manhattan distances
man_Boston <- dist(Boston_rescaled, method = "manhattan")

summary(euc_Boston)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
summary(man_Boston)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Manhattan distances tend to have higher units compared to Euclidian distances.

Next we’ll run K-means clustering and try to determine what is the best amount of centers for our data.

# Define a maximum and calculate the total within sum of squares
max <- 10
Twcss <- sapply(1:max, function(k){kmeans(Boston_rescaled, k)$tot.withinss})

# Visualise
qplot(x = 1:max, y = Twcss, geom = 'line')

There’s a radical drop in the within cluster sum of squares at about point 2. We’ll use this as an optimal number of clusters.

# K-means clustering
km <- kmeans(Boston_rescaled, centers = 2)

# Plot the Boston dataset with clusters
pairs(Boston_rescaled, col = km$cluster)

Based on this, the two cluster optimum didn’t come out of nowhere. The two clusters seem to fall into their own groups in many cases like with variables crime, indus, nox, black, lstat, dis. One notable excpetion is the variable chas (Charles River dummy varaible) where the points are mixed when paired with e.g. age and pupil-teacher ratio (ptratio). We can’t, however, infer what exactly creates the dissimilarity. That’s a whole other exercise.


Chapter 5 - Dimensionality reduction techniques

Overview of the human data

library(dplyr);library(ggplot2);library(GGally);library(corrplot)
summary(human)
##      edu_FM         labour_FM        edu_exp         life_exp    
##  Min.   :0.1717   Min.   :13.50   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:44.45   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :53.50   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :52.52   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:61.90   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :88.10   Max.   :20.20   Max.   :83.50  
##       GNI         mat_mort_ratio   adol_birthrate   rep_parliament 
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
ggpairs(human, upper = list(continuous = "cor_v1_5"))

cor(human) %>% corrplot

Our dataset includes information about different countries’ human development variables. Most of the variables are numerical, with a few integers among the crowd. Looking at these graphical variable distributions we see that

  • In most countries there are more men with secondary education than women (edu_FM)
  • More than half of our world’s women participate in labour force (labour_FM)
  • Expected years of education is mostly over 10 years (edu_exp)
  • Life expectancy is as low as 49 years with 83 years being the highest (life_exp)
  • Gross national income is skewed to the left (GNI)
  • Maternal mortality ratio is also highly skewed to the left (mat_mort_ratio)
  • Adolescent birhtrate is mostly skewed the left (adol_birthrate)
  • In most countries about 20 % of the people in the parliament are women (rep_parliament)

If we look at the correlation plot, we see that for example, maternal mortality ratio is positively correlated with adolescent birthrate but negatively correlated with expected years of education, GNI and life expectancy. Parliamentary representation of women is slightly correlated with a higher amount of women participating in labour force, life expectancy and expected years of education. GNI is positively correlated with women in education, expected years of education and life expectancy but negatively correlated with maternal mortality ratio and adolescent birthrate.

PCA plots

We’ll draw a PCA plot with these unstandardised data.

pca <- prcomp(human)
biplot(pca, choices = 1:2, cex = c(0.6, 0.8))

It’s basically a mess. GNI is the only thing we can see point to the left (PC2). Let’s take a look at the components themselves.

summary(pca)$importance
##                               PC1      PC2      PC3      PC4     PC5      PC6
## Standard deviation     18544.1640 185.6147 25.20641 15.37613 10.6753 3.766125
## Proportion of Variance     0.9999   0.0001  0.00000  0.00000  0.0000 0.000000
## Cumulative Proportion      0.9999   1.0000  1.00000  1.00000  1.0000 1.000000
##                             PC7       PC8
## Standard deviation     1.545704 0.1733653
## Proportion of Variance 0.000000 0.0000000
## Cumulative Proportion  1.000000 1.0000000

We see that PC1’s proportion of variance is basically 100 %. Doesn’t sound too realistic. We’ll standardise the data before making a new PCA plot.

human_scaled <- scale(human)
pca_human <- prcomp(human_scaled)


s1 <- summary(pca)
s2 <- summary(pca_human)

pca_pr1 <- round(100*s1$importance[2, ], digits = 1)
pca_pr2 <- round(100*s2$importance[2, ], digits = 1)
pc_lab1 <- paste0(names(pca_pr1), " (", pca_pr1, "%)")
pc_lab2 <- paste0(names(pca_pr2), " (", pca_pr2, "%)")
biplot(pca, choices = 1:2, cex = c(0.6, 0.8), xlab = pc_lab1[1], ylab = pc_lab1[2])
PCA plot of human data (no standardisation). As previously mentioned, PC1 causes 100 % of the variance between observations, but no distinct variable can be interpreted as the cause. Gross national income (GNI) points to the left, indicating that PC2 and GNI are linked, however, PC2 accounts for 0 % of the variance. Basically this plot is inconclusive.

PCA plot of human data (no standardisation). As previously mentioned, PC1 causes 100 % of the variance between observations, but no distinct variable can be interpreted as the cause. Gross national income (GNI) points to the left, indicating that PC2 and GNI are linked, however, PC2 accounts for 0 % of the variance. Basically this plot is inconclusive.

biplot(pca_human, choices = 1:2, cex = c(0.6, 0.7), xlab = pc_lab2[1], ylab = pc_lab2[2])
PCA plot of human data (standardised). We no longer have a single principal component causing all the variance. PC1 accounts for 54.5 % of variance and PC2 accounts for 15.5 % of variance. Different variables go in groups, with parliamentary representation of women and the proportion of women in labour force pointing up, maternal mortality ratio and adolescent birthrate pointing to the right and GNI, expected years of eduation, life expectancy and proportion of women in secondary education pointing to the left. We also see what drives the countries to different sides of the plot. For example, maternal mortality ratio and adolescent birthrate are important variables for Gambia, proportion of women in labour force for Zimbabwe and parliamentary representation of women for Iceland.

PCA plot of human data (standardised). We no longer have a single principal component causing all the variance. PC1 accounts for 54.5 % of variance and PC2 accounts for 15.5 % of variance. Different variables go in groups, with parliamentary representation of women and the proportion of women in labour force pointing up, maternal mortality ratio and adolescent birthrate pointing to the right and GNI, expected years of eduation, life expectancy and proportion of women in secondary education pointing to the left. We also see what drives the countries to different sides of the plot. For example, maternal mortality ratio and adolescent birthrate are important variables for Gambia, proportion of women in labour force for Zimbabwe and parliamentary representation of women for Iceland.

As we see, the two plots are very different. PCA is sensitive to the relative scaling of the original variable features, and larger variance is interpreted as having more importance. The original data variables have different variances which we can see here:

var(human)
##                      edu_FM     labour_FM       edu_exp     life_exp
## edu_FM            0.0583897 -4.222245e-01     0.4071587     1.159753
## labour_FM        -0.4222245  2.574366e+02    -6.5651998   -37.076528
## edu_exp           0.4071587 -6.565200e+00     8.0670239    18.682196
## life_exp          1.1597535 -3.707653e+01    18.6821956    69.423283
## GNI            1928.1655568 -3.137263e+04 32883.4502723 96824.966255
## mat_mort_ratio  -33.8243423  1.251877e+03  -442.5512317 -1512.597377
## adol_birthrate   -5.2594013  1.947709e+02   -82.1542388  -249.778424
## rep_parliament    0.2182832  3.733809e+01     6.7240452    16.280088
##                          GNI mat_mort_ratio adol_birthrate rep_parliament
## edu_FM              1928.166  -3.382434e+01  -5.259401e+00      0.2182832
## labour_FM         -31372.631   1.251877e+03   1.947709e+02     37.3380930
## edu_exp            32883.450  -4.425512e+02  -8.215424e+01      6.7240452
## life_exp           96824.966  -1.512597e+03  -2.497784e+02     16.2800884
## GNI            343874461.958  -1.944698e+06  -4.243095e+05  19003.7563259
## mat_mort_ratio  -1944698.063   4.485483e+04   6.605745e+03   -217.6061751
## adol_birthrate   -424309.466   6.605745e+03   1.690201e+03    -33.4746473
## rep_parliament     19003.756  -2.176062e+02  -3.347465e+01    131.9683058
var(human_scaled)
##                     edu_FM  labour_FM    edu_exp   life_exp         GNI
## edu_FM          1.00000000 -0.1089031  0.5932516  0.5760299  0.43030485
## labour_FM      -0.10890308  1.0000000 -0.1440642 -0.2773393 -0.10544253
## edu_exp         0.59325156 -0.1440642  1.0000000  0.7894392  0.62433940
## life_exp        0.57602985 -0.2773393  0.7894392  1.0000000  0.62666411
## GNI             0.43030485 -0.1054425  0.6243394  0.6266641  1.00000000
## mat_mort_ratio -0.66093177  0.3684019 -0.7357026 -0.8571684 -0.49516234
## adol_birthrate -0.52941841  0.2952703 -0.7035649 -0.7291774 -0.55656208
## rep_parliament  0.07863528  0.2025733  0.2060816  0.1700863  0.08920818
##                mat_mort_ratio adol_birthrate rep_parliament
## edu_FM             -0.6609318     -0.5294184     0.07863528
## labour_FM           0.3684019      0.2952703     0.20257328
## edu_exp            -0.7357026     -0.7035649     0.20608156
## life_exp           -0.8571684     -0.7291774     0.17008631
## GNI                -0.4951623     -0.5565621     0.08920818
## mat_mort_ratio      1.0000000      0.7586615    -0.08944000
## adol_birthrate      0.7586615      1.0000000    -0.07087810
## rep_parliament     -0.0894400     -0.0708781     1.00000000

Looking at these tables it’s no wonder GNI was the only one that had a distinct arrow in the first PCA plot since GNI variance is huge. The variances are much more even in the second summary. If we compare the standardised PCA plot to the previously shown correlation plot, the results make sense. For example, we already established that maternal mortality ratio and adolescent birthrate are correlated. These two variables, as well as life expectancy, expected years of education, proportion of women in secondary education and GNI mostly belong to PC2. Labour force and parliamentary representation, on the other hand, mostly belong to PC1. Since PC1 accounts for 54.5 % of the whole variance, we can conclude parliamentary representation of women and proportion of women in labour force creates significant differences between the countries. GNI, life expectancy, expected years of education and proportion of women in secondary education are and “opposite force” to maternal mortality ratio and adolescent birthrate. More education goes with higher life expectancy, and if women gain access to secondary education, they are not as likely to stay home and give birth at a young age. Also, women are less likely to die to childbirth if they have access to secondary education. More education correlating with GNI makes sense, since if there are more educated people in the country, people have access to more jobs and there’s more money flowing in the country.

The tea dataset

We’ll move on to the tea dataset from Factominer. Let’s explore the data.

library(FactoMineR);library(tidyr)
data(tea)

glimpse(tea)
## Rows: 300
## Columns: 36
## $ breakfast        <fct> breakfast, breakfast, Not.breakfast, Not.breakfast, b…
## $ tea.time         <fct> Not.tea time, Not.tea time, tea time, Not.tea time, N…
## $ evening          <fct> Not.evening, Not.evening, evening, Not.evening, eveni…
## $ lunch            <fct> Not.lunch, Not.lunch, Not.lunch, Not.lunch, Not.lunch…
## $ dinner           <fct> Not.dinner, Not.dinner, dinner, dinner, Not.dinner, d…
## $ always           <fct> Not.always, Not.always, Not.always, Not.always, alway…
## $ home             <fct> home, home, home, home, home, home, home, home, home,…
## $ work             <fct> Not.work, Not.work, work, Not.work, Not.work, Not.wor…
## $ tearoom          <fct> Not.tearoom, Not.tearoom, Not.tearoom, Not.tearoom, N…
## $ friends          <fct> Not.friends, Not.friends, friends, Not.friends, Not.f…
## $ resto            <fct> Not.resto, Not.resto, resto, Not.resto, Not.resto, No…
## $ pub              <fct> Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, Not.pub,…
## $ Tea              <fct> black, black, Earl Grey, Earl Grey, Earl Grey, Earl G…
## $ How              <fct> alone, milk, alone, alone, alone, alone, alone, milk,…
## $ sugar            <fct> sugar, No.sugar, No.sugar, sugar, No.sugar, No.sugar,…
## $ how              <fct> tea bag, tea bag, tea bag, tea bag, tea bag, tea bag,…
## $ where            <fct> chain store, chain store, chain store, chain store, c…
## $ price            <fct> p_unknown, p_variable, p_variable, p_variable, p_vari…
## $ age              <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, 32, 31, 56, 6…
## $ sex              <fct> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M, M, M, F,…
## $ SPC              <fct> middle, middle, other worker, student, employee, stud…
## $ Sport            <fct> sportsman, sportsman, sportsman, Not.sportsman, sport…
## $ age_Q            <fct> 35-44, 45-59, 45-59, 15-24, 45-59, 15-24, 35-44, 35-4…
## $ frequency        <fct> 1/day, 1/day, +2/day, 1/day, +2/day, 1/day, 3 to 6/we…
## $ escape.exoticism <fct> Not.escape-exoticism, escape-exoticism, Not.escape-ex…
## $ spirituality     <fct> Not.spirituality, Not.spirituality, Not.spirituality,…
## $ healthy          <fct> healthy, healthy, healthy, healthy, Not.healthy, heal…
## $ diuretic         <fct> Not.diuretic, diuretic, diuretic, Not.diuretic, diure…
## $ friendliness     <fct> Not.friendliness, Not.friendliness, friendliness, Not…
## $ iron.absorption  <fct> Not.iron absorption, Not.iron absorption, Not.iron ab…
## $ feminine         <fct> Not.feminine, Not.feminine, Not.feminine, Not.feminin…
## $ sophisticated    <fct> Not.sophisticated, Not.sophisticated, Not.sophisticat…
## $ slimming         <fct> No.slimming, No.slimming, No.slimming, No.slimming, N…
## $ exciting         <fct> No.exciting, exciting, No.exciting, No.exciting, No.e…
## $ relaxing         <fct> No.relaxing, No.relaxing, relaxing, relaxing, relaxin…
## $ effect.on.health <fct> No.effect on health, No.effect on health, No.effect o…
#First 18 variables
gather(tea[,1:18]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 5))

# Last 18 variables
gather(tea[,19:36]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 5))

Most of the variables are factors, with one integer variable (age). The dataset was construced by asking 300 individuals (rows) how they drink tea, how they perceive their products along with personal questions (altogether 36 questions, columns).

We’ll plot the data with MCA. First, we’ll subset a bit to make the analysis a bit easier.

# Column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "age_Q")
tea_time <- dplyr::select(tea, one_of(keep_columns))

mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.313   0.266   0.249   0.205   0.184   0.175   0.168
## % of var.             13.434  11.404  10.689   8.791   7.883   7.511   7.218
## Cumulative % of var.  13.434  24.838  35.527  44.318  52.201  59.712  66.930
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.147   0.141   0.135   0.119   0.101   0.072   0.056
## % of var.              6.302   6.051   5.787   5.086   4.348   3.094   2.403
## Cumulative % of var.  73.231  79.282  85.069  90.155  94.503  97.597 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.068  0.005  0.002 |  0.048  0.003  0.001 | -0.471
## 2                  |  0.142  0.021  0.009 |  0.089  0.010  0.004 | -1.069
## 3                  | -0.199  0.042  0.033 | -0.254  0.081  0.053 | -0.600
## 4                  | -0.798  0.678  0.665 | -0.240  0.072  0.060 |  0.064
## 5                  | -0.199  0.042  0.033 | -0.254  0.081  0.053 | -0.600
## 6                  | -0.582  0.360  0.361 | -0.167  0.035  0.030 | -0.235
## 7                  | -0.190  0.039  0.022 | -0.036  0.002  0.001 | -0.411
## 8                  |  0.150  0.024  0.009 |  0.307  0.118  0.036 | -0.880
## 9                  |  0.268  0.076  0.026 |  0.900  1.016  0.290 |  0.175
## 10                 |  0.605  0.390  0.137 |  0.870  0.948  0.282 | -0.073
##                       ctr   cos2  
## 1                   0.296  0.106 |
## 2                   1.527  0.527 |
## 3                   0.481  0.297 |
## 4                   0.005  0.004 |
## 5                   0.481  0.297 |
## 6                   0.074  0.059 |
## 7                   0.226  0.103 |
## 8                   1.035  0.298 |
## 9                   0.041  0.011 |
## 10                  0.007  0.002 |
## 
## Categories (the 10 first)
##                       Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test
## black              |  0.750  7.377  0.184  7.421 |  0.467  3.365  0.071  4.618
## Earl Grey          | -0.389  5.179  0.273 -9.036 | -0.016  0.010  0.000 -0.367
## green              |  0.594  2.063  0.044  3.610 | -0.954  6.272  0.113 -5.800
## alone              | -0.096  0.316  0.017 -2.253 | -0.262  2.803  0.128 -6.183
## lemon              |  0.483  1.366  0.029  2.938 |  0.202  0.282  0.005  1.229
## milk               | -0.091  0.093  0.002 -0.813 |  0.315  1.307  0.026  2.811
## other              |  0.938  1.404  0.027  2.853 |  2.736 14.071  0.232  8.321
## tea bag            | -0.460  6.376  0.277 -9.096 | -0.154  0.842  0.031 -3.046
## tea bag+unpackaged |  0.174  0.504  0.014  2.031 |  0.719 10.150  0.236  8.400
## unpackaged         |  1.718 18.837  0.403 10.972 | -1.150  9.948  0.180 -7.346
##                       Dim.3    ctr   cos2 v.test  
## black              | -0.740  9.026  0.179 -7.322 |
## Earl Grey          |  0.334  4.788  0.201  7.751 |
## green              | -0.293  0.629  0.011 -1.778 |
## alone              | -0.047  0.098  0.004 -1.118 |
## lemon              |  1.264 11.746  0.197  7.685 |
## milk               | -0.379  2.011  0.038 -3.375 |
## other              | -0.957  1.836  0.028 -2.910 |
## tea bag            | -0.436  7.208  0.249 -8.627 |
## tea bag+unpackaged |  0.663  9.193  0.200  7.740 |
## unpackaged         |  0.330  0.873  0.015  2.107 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.275 0.154 0.216 |
## How                | 0.060 0.295 0.235 |
## how                | 0.484 0.334 0.259 |
## sugar              | 0.132 0.013 0.200 |
## where              | 0.528 0.632 0.233 |
## age_Q              | 0.402 0.169 0.354 |

There are altogether 14 dimensions. The first three dimensions retain 10-13 % of variance each. We only see the first 10 individuals (rows) and their contribution to the dimensiona (ctr) as well as squared correlations to the dimension (cos2). Same with the first 10 categories and their relationship to the first dimension. Lastly, we have the categorical variables and their squared correlation to each dimension. If these values are close to 1, there is a strong link with the variable and the dimension. We see the variable “where” has the strongest link to dimension 1, with “how” being quite close as a second and “age_Q” as a third. Additionally, the variable “where” has the strongest relationship with dimension 2 and the second “How” is has a stronger link with dimension 2 compared to dimension 1.

# Plot function for MCA, hide individual observations
plot.MCA(mca, invisible = c("ind"))
## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
MCA plot of the subsetted tea data. The closer the variables, the more similar they are. Buying tea from specialised tea shops is similar to consuming unpackaged tea, which makes sense, since tea shops usually sell unpackaged tea. They both have a strong relationship to dimension 1. Green tea consumers are probably more likely to buy from unpackaged tea from tea shops. Tea consuption is different in the age groups. For example, 15-24 year olds are quite close to tea bag, Earl Grey and chain stores. Presumably chain stores mostly sell tea bags (instead of unpackaged) and Earl Grey is a popular option. Also, chain store Early Grey tea bags are likely cheap, and 15-24 year olds usually don't have that high of an income. 60+ year olds are more likely to drink black tea. Drinking tea as it is (=alone) or with sugar are close to the centre, which indicates that consuming tea either of these ways vary enough between all subjects for those not to be pulled towards a certain dimension too strongly. No sugar, milk and lemon are a bit farther from the centre, but still quite close. Consuming tea in an other way is far from other categories, so they are a certain group of people.

MCA plot of the subsetted tea data. The closer the variables, the more similar they are. Buying tea from specialised tea shops is similar to consuming unpackaged tea, which makes sense, since tea shops usually sell unpackaged tea. They both have a strong relationship to dimension 1. Green tea consumers are probably more likely to buy from unpackaged tea from tea shops. Tea consuption is different in the age groups. For example, 15-24 year olds are quite close to tea bag, Earl Grey and chain stores. Presumably chain stores mostly sell tea bags (instead of unpackaged) and Earl Grey is a popular option. Also, chain store Early Grey tea bags are likely cheap, and 15-24 year olds usually don’t have that high of an income. 60+ year olds are more likely to drink black tea. Drinking tea as it is (=alone) or with sugar are close to the centre, which indicates that consuming tea either of these ways vary enough between all subjects for those not to be pulled towards a certain dimension too strongly. No sugar, milk and lemon are a bit farther from the centre, but still quite close. Consuming tea in an other way is far from other categories, so they are a certain group of people.


Chapter 6 - Analysis of longitudinal data

RATS data

The RATS study is a nutritional study on rats. The rats were assigned into three groups with different diets, and their body weight was recorded approximately weekly (week 7 includes two recordings) up until week 9. The researchers wanted to explore whether the growth profiles differ between the groups.

Let’s take a look at the data.

library(dplyr);library(tidyr);library(ggplot2)

glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD     <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1…
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))

The data frame has 5 columns and 176 rows. There are 16 rats in total, eight of them belong to group 1, four belong to group 2 and four to group 3. Rat ID is repeated due to weekly weight measurements. From the plot it seems Group 1 rats have had the lowest body weight throughout the study but groups 2 and 3 have had an increase. However, these rats also had more body weight to begin with. Group 2 has one rat whose body weight is high throughout the study, and group 3 as well as group 1 has a rat whose body weight is low compared to others in the group.

Let’s standardise the weights and make a new plot.

# Standardise weight
RATSL <- RATSL %>%
  group_by(Group) %>%
  mutate(stdweight=scale(Weight) ) %>%
  ungroup()

# Look at the new data
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID        <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,…
## $ Group     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, …
## $ WD        <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, …
## $ Weight    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, …
## $ Time      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, …
## $ stdweight <dbl[,1]> <matrix[26 x 1]>
# Plot the data
ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized weight") +
  theme(legend.position = "none")

Now the weight increase looks similar in each group and the “outliers” mentioned before are now even more distinct.

Let’s look at the mean weight profiles.

# Extract number of measurement times
n <- RATSL$Time %>% unique() %>% length()

# Summary data with mean and standard error of weight by group and measurement times 
RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise(mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# Glimpse the data
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, …
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, 2…
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3.30…
# Plot the mean profiles with standard error
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  geom_point(size=3) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.8)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)") +
  theme(legend.position = "right")

There deos seem to be some more increase in weight in groups 2 and 3. Group 2 has a larger standard error than group 3. Group 1 has the lowest standard error.

Let’s see if there are still outliers after averaging weight.

# Create a summary data by group and ID with mean as the summary variable.
RATSL8S <- RATSL %>%
  group_by(Group, ID) %>%
  summarise(mean=mean(Weight) ) %>%
  ungroup()

# Glimpse the data
glimpse(RATSL8S)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274.…
# Draw a boxplot of the mean versus group
ggplot(RATSL8S, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight)")
## Warning: `fun.y` is deprecated. Use `fun` instead.

From this plot we see there’s one actual outlier in group 1, which is the lowest datapoint. Group 2 and 3 seem to have outliers BUT we only have four(!) rats in these groups. The amount is ridiculously small that it’s dangerous to take anything out. Group 1 has 8 rats, so taking one out would still be feasible… But truth be told the whole research setting of one group having twice the amount of rats compared to other groups is not good. For the sake of my sanity I’d rather take four rats out of group 1 to make the groups more comparable. One of the excluded rats will be the outlier.

RATSL_exc <- RATSL8S %>%
  filter(ID != 1, ID != 2, ID != 3, ID != 4) %>%
  group_by(mean) %>%
  ungroup()

glimpse(RATSL_exc)
## Rows: 12
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 269.4545, 274.7273, 274.6364, 265.4545, 440.8182, 452.7273, 454.…

Let’s perform a t-test for this dataset. T-test requires two sets at a time, so we’ll subset the data to suit thit test.

# Subsetting for t-tests
test1 <- RATSL_exc %>% filter(Group != 1)
test2 <- RATSL_exc %>% filter(Group != 2)
test3 <- RATSL_exc %>% filter(Group != 3)

# Two-sample t-tests
t.test(mean ~ Group, data = test1, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -1.1086, df = 6, p-value = 0.3101
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -131.79027   49.60845
## sample estimates:
## mean in group 2 mean in group 3 
##        484.7045        525.7955
t.test(mean ~ Group, data = test2, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -22.612, df = 6, p-value = 4.898e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -282.2922 -227.1623
## sample estimates:
## mean in group 1 mean in group 3 
##        271.0682        525.7955
t.test(mean ~ Group, data = test3, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -6.0255, df = 6, p-value = 0.0009433
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -300.3927 -126.8800
## sample estimates:
## mean in group 1 mean in group 2 
##        271.0682        484.7045

The results are not significant between groups 2 and 3. The results are significant when comparing groups 1 against group 2 or group 3. The confidence interval includes 0 when comparing groups 2 and 3 but not when comparing group 1 against group 2 and 3, so we can reject the null hypothesis with 95 % certainty in the latter case.

Before we’ll fit the linear model, we’ll add a baseline to the data.

# Add the baseline from the original data as a new variable to the summary data
RATS_filtered <- RATS %>% filter(ID != 1, ID != 2, ID != 3, ID != 4)

RATSL_exc <- RATSL_exc %>%
  mutate(baseline = RATS_filtered$WD1)

# Fit the linear model with the mean as the response 
fit <- lm(mean ~ baseline + Group, data = RATSL_exc)

# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value    Pr(>F)    
## baseline   1 164212  164212 1045.8116 9.109e-10 ***
## Group      2    700     350    2.2285    0.1701    
## Residuals  8   1256     157                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA indicates there’s no significant difference in the groups in their weight gain compared to their baselines.

To summarise: The results are in weight gain are significant if comparing groups 2 and 3 to group 1 (t-test) but within group difference is negligible (ANOVA).

BPRS data

In this study we have 40 men who were randomly assigned into two treatment groups and their progress was monitored with brief psychiatric rating scale (BPRS). The measurements were made before treatment (week 0) up until week 8. BPRS assesses symptom constructs (e.g. hostility, hallucinations) and they are rated from one to seven (1 = not present, 7 = extremely severe).

We’ll take a look at the dataset

glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks     <fct> week0, week0, week0, week0, week0, week0, week0, week0, week…
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
  geom_line(aes(linetype=subject)) + facet_grid(. ~ treatment, labeller = label_both) +
  scale_x_continuous(name= "Weeks") + scale_y_continuous(name = "BPRS points") + theme(legend.position = "top") +
  theme(legend.position = "none")

We see the two treatment groups look fairly similar. Treatment group 2 has one outlier subject whose BPRS score is high most of the time.

We’ll fit two models. Both are random intercept and random slope models, the first is without interaction and second with interaction. Well see the effect of weeks and treatments and their interaction to BPRS scores. Additionally, we’ll take each subject into account as a random effect.

library(lme4)

# Fitting the models
BPRS_model1 <- lmer(bprs ~ week + treatment + (weeks | subject), data = BPRSL)

# With interaction
BPRS_model2 <- lmer(bprs ~ week * treatment + (weeks | subject), data = BPRSL)

# Model summaries
summary(BPRS_model1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: bprs ~ week + treatment + (weeks | subject)
##    Data: BPRSL
## 
## REML criterion at convergence: 2720.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8750 -0.5665 -0.0880  0.5723  3.3855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr                               
##  subject  (Intercept) 69.49    8.336                                       
##           weeksweek1  12.05    3.472     0.26                              
##           weeksweek2  13.39    3.659    -0.22  0.63                        
##           weeksweek3  22.84    4.779    -0.27  0.83  0.89                  
##           weeksweek4  35.24    5.936    -0.59  0.61  0.58  0.85            
##           weeksweek5  44.38    6.662    -0.72  0.41  0.38  0.69  0.97      
##           weeksweek6  56.45    7.513    -0.64  0.45  0.32  0.67  0.96  0.99
##           weeksweek7  55.05    7.419    -0.44  0.57  0.23  0.64  0.91  0.93
##           weeksweek8  56.12    7.491    -0.32  0.65  0.24  0.66  0.89  0.88
##  Residual             93.17    9.652                                       
##             
##             
##             
##             
##             
##             
##             
##             
##   0.97      
##   0.93  0.99
##             
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  44.0875     1.9989  22.055
## week         -2.1522     0.2958  -7.275
## treatment2    0.5722     1.0174   0.562
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.578       
## treatment2 -0.254  0.000
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see ?isSingular
summary(BPRS_model2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: bprs ~ week * treatment + (weeks | subject)
##    Data: BPRSL
## 
## REML criterion at convergence: 2717.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9990 -0.5893 -0.0968  0.5262  3.5095 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr                               
##  subject  (Intercept) 69.91    8.361                                       
##           weeksweek1  12.17    3.488     0.25                              
##           weeksweek2  13.73    3.706    -0.22  0.64                        
##           weeksweek3  23.33    4.830    -0.27  0.83  0.89                  
##           weeksweek4  35.73    5.977    -0.59  0.61  0.58  0.85            
##           weeksweek5  44.85    6.697    -0.72  0.41  0.39  0.69  0.97      
##           weeksweek6  56.96    7.547    -0.64  0.46  0.33  0.67  0.96  0.99
##           weeksweek7  55.42    7.444    -0.44  0.57  0.23  0.64  0.91  0.93
##           weeksweek8  56.48    7.515    -0.32  0.65  0.24  0.66  0.89  0.88
##  Residual             92.44    9.614                                       
##             
##             
##             
##             
##             
##             
##             
##             
##   0.97      
##   0.93  0.99
##             
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      45.4954     2.1453  21.207
## week             -2.5082     0.3548  -7.069
## treatment2       -2.2911     1.8687  -1.226
## week:treatment2   0.7158     0.3925   1.824
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.436  0.465       
## wek:trtmnt2  0.366 -0.553 -0.840
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see ?isSingular

The slope for week is negative in the non-interaction model, suggesting that the BPRS scores go down with time. Treatment group 2 seems to have slightly higher BPRS scores compared to treatment group 1. The slopes are negative for week and treatment 2 in the interaction model, suggesting that treatment group 2 has lower BPRS scores than treatment group 1. However, the interaction between week and treatment does not indicate the same. We were right to add subject as a random effect, since subject variance is high in both models.

# Comparison of the models with ANOVA
anova(BPRS_model1, BPRS_model2)
## Data: BPRSL
## Models:
## BPRS_model1: bprs ~ week + treatment + (weeks | subject)
## BPRS_model2: bprs ~ week * treatment + (weeks | subject)
##             npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_model1   49 2822.4 3012.8 -1362.2   2724.4                       
## BPRS_model2   50 2821.1 3015.4 -1360.5   2721.1 3.3344  1    0.06785 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the ANOVA test model 2 has slightly smaller AIC value but slightly bigger BIC value than model 1, and the likelihood ratio test of model 1 against model 2 gives a chi-squared statistic of 3.33 with 1 degree of freedom (Df). The p value is not statistically significant (> 0.05). These results do not clearly indicate which model is a better fit for our data.

Either way it seems there are no statistically significant differences between the treatment groups.

We’ll create new plots with fitted values from both of these models and compare them to the actual values, even though we can’t say the models are good fit for the data.

# Create a vector of the fitted values
Fitted1 <- fitted(BPRS_model1)
Fitted2 <- fitted(BPRS_model2)

# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>% mutate(fitted1 = Fitted1, fitted2 = Fitted2)

# Draw the plot of BPRSL with the observed BPRS values
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
  geom_line(aes(linetype=subject)) + facet_grid(. ~ treatment, labeller = label_both) +
  scale_x_continuous(name= "Weeks") + scale_y_continuous(name = "BPRS points") + theme(legend.position = "top") +
  theme(legend.position = "none")

# Draw the plots of BPRSL with the Fitted values of BPRS (model 1)
ggplot(BPRSL, aes(x = week, y = fitted1, group = subject)) +
  geom_line(aes(linetype = subject)) + facet_grid(. ~ treatment, labeller = label_both) +
  scale_x_continuous(name = "Weeks") +
  scale_y_continuous(name = "Fitted BPRS points") +
  theme(legend.position = "none")

# Model 2
ggplot(BPRSL, aes(x = week, y = fitted2, group = subject)) +
  geom_line(aes(linetype = subject)) + facet_grid(. ~ treatment, labeller = label_both) +
  scale_x_continuous(name = "Weeks") +
  scale_y_continuous(name = "Fitted BPRS points") +
  theme(legend.position = "none")

The fitted plots are very similar to each other, so no help in that regard. What we do see is that the fitted values do okish at modelling our data. The fitted values are not good at handling outliers, so one subject randomly jumps up with their BPRS score and the subject with extreme BPRS scores has gone down compared to the actual values.